In May 2020, the Georgia Department of Public Health posted the following plot to illustrate the number of confirmed COVID-19 cases in their hardest-hit counties over a two-week period. Health officials claimed that the plot provided evidence that COVID-19 cases were decreasing and made the argument for reopening the state.
The plot was heavily criticized by the statistical community and several media outlets for its deceptive portrayal of COVID-19 trends in Georgia. Whether the end result was due to malicious intent or simply poor judgment, it is incredibly irresponsible to publish data visualizations that obscure and distort the truth.
Data visualization is an incredibly powerful tool that can affect health policy decisions. Ensuring they are easy to interpret, and more importantly, showcase accurate insights from data is paramount for scientific transparency and the health of individuals. For this assignment you are tasked with reproducing COVID-19 visualizations and tables published by the New York Times. Specifically, you will attempt to reproduce the following for January 12th, 2022:
Data for cases and deaths can be downloaded from this NYT GitHub repository (use us-counties.csv). Data for hospitalizations can be downloaded from The COVID Tracking Project. The project must be submitted in the form of a Jupyter notebook or RMarkdown file and corresponding compiled/knitted PDF, with commented code and text interspersed, including a brief critique of the reproducibility of each plot and table. All project documents must be uploaded to a GitHub repository each student will create within the reproducible data science organization. The repository must also include a README file describing the contents of the repository and how to reproduce all results. You should keep in mind the file and folder structure we covered in class and make the reproducible process as automated as possible.
Tips:
lag function. In this toy example, cases records the daily total/cumulative number of cases over a two-week period. By default, the lag function simply shifts the vector of cases back by one. The number of new cases on each day is then the difference between cases and lag(cases).cases = c(13, 15, 18, 22, 29, 39, 59, 61, 62, 67, 74, 89, 108, 122)
new_cases = cases - lag(cases)
new_cases
## [1] NA 2 3 4 7 10 20 2 1 5 7 15 19 14
zoo package already provides the rollmean function. Below, the k = 7 argument tells the function to use a rolling window of seven entries. fill = NA tells rollmean to return NA for days where the seven-day rolling average can’t be calculated (e.g. on the first day, there are no days that come before, so the sliding window can’t cover seven days). That way, new_cases_7dayavg will be the same length as cases and new_cases, which would come in handy if they all belonged to the same data frame.library(zoo)
new_cases_7dayavg = rollmean(new_cases, k = 7, fill = NA)
new_cases_7dayavg
## [1] NA NA NA NA 6.857143 6.714286 7.000000 7.428571
## [9] 8.571429 9.857143 9.000000 NA NA NA
Create the new cases as a function of time with a rolling average plot - the first plot on the page (you don’t need to recreate the colors or theme).
Create the table of cases, hospitalizations and deaths - the first table on the page, right below the figure you created in task #1. You don’t need to include tests.
Create the county-level map for previous week (‘Hot spots’) - the second plot on the page (only the ‘Hot Spots’ plot). You don’t need to include state names and can use a different color palette.
Create the table of cases by state - the second table on the page (do not need to include per 100,000, 14-day change, or fully vaccinated columns).
Provide a brief critique of the reproducibility of the figures and tables you created in tasks 1-4.